perf(vectors): batch per-file embeddings + offload parse/enrich#334
Merged
Conversation
process_*_file embedded each chunk serially (one `await embedder.embed()` per chunk in a for-loop), so the batched SentenceTransformerEmbedder (max_batch_size=64) only ever saw batches of 1. This wasted its batching within a file and hurt `increment` most: few changed files means little cross-file concurrency, so embeddings ran one-at-a-time. #1 — embed all chunks of a file concurrently via asyncio.gather so the embedder groups them into one model.encode(...). Applied to java/sql/yaml. #2 — for java, move parse_java + per-chunk enrich_chunk off the event loop (asyncio.to_thread) so the loop keeps the embedder's batching queue fed while a file is being parsed. parse_java is serialized by _PARSE_LOCK because it reuses the lru-cached singleton tree-sitter Parser, whose parse() mutates state and is not thread-safe. splitter.split stays inline (shared Rust RecursiveSplitter singleton); enrich_chunk is pure-Python over the immutable AST (lru_cache reads are GIL-safe) and runs without the lock. Output is equivalent: same chunk texts → same normalized vectors; rows are still declared in chunk order; uuids stay random. No schema/ontology/env change. Because process_java_file is @coco.fn(memo=True), the changed body invalidates memo entries, so the first run after upgrade reprocesses (equivalent to a full reindex) — subsequent runs are incremental again. Verified: ruff clean; 838 light tests pass; heavy test_lancedb_e2e (real cocoindex update --full-reprocess + graph + search) passes. One heavy test (test_lancedb_ignore_file_reduces_indexed_java_files) fails, but it fails identically on clean master (FileExistsError in the test's own mkdir, unrelated to this change). Co-Authored-By: Claude <noreply@anthropic.com>
Code review of the vectors perf change found a race it introduced: _parse_and_enrich_java runs enrich_chunk outside _PARSE_LOCK on concurrent worker threads, and enrich_chunk -> collect_annotation_meta_chain (lru_cached, non-serializing) -> _collect_annotation_decl_index -> parse_java. On a cold cache, multiple threads hit parse_java concurrently on the shared @lru_cache tree-sitter Parser, whose parse() mutates state and is not thread-safe -> corrupt AST or native segfault. (Before the perf change, enrich ran on the single event-loop thread, so this was serialized.) Root-cause fix at the source: _parser() now returns a per-thread Parser (threading.local) instead of a process-wide lru_cached singleton. Each OS thread gets its own Parser; the Language is immutable and shared. This protects EVERY parse_java caller — the flow's direct parse, enrich's transitive parse, and build_ast_graph — and preserves parse parallelism rather than serializing it. Removes the now-redundant _PARSE_LOCK from the flow file. Added tests/test_ast_java_thread_safety.py: 16 threads × 60 parses must each match the single-threaded reference (locks the invariant in; a revert to a shared Parser would corrupt/crash here). Verified: ruff clean; 839 light tests pass (+1 new); heavy test_lancedb_e2e main path (full reprocess + graph + search) passes. The pre-existing test_lancedb_ignore_file_reduces_indexed_java_files failure is unchanged (its own mkdir bug, fails on master). Co-Authored-By: Claude <noreply@anthropic.com>
…hreads Second code-review round found a perf regression introduced by lever #2: making enrich_chunk concurrent meant the first wave of worker threads each cold-miss collect_annotation_meta_chain (lru_cached, non-serializing) and each redundantly walk+parse the ENTIRE project (~min(32,cpu+4)x redundant full-project parses at cold start, GIL-serialized into the critical path) — directly offsetting the embedding-batching win on large repos. Fix: warm collect_annotation_meta_chain + load_brownfield_overrides ONCE on the event-loop thread in app_main, before coco.mount_each fans files into worker threads. Key derivation mirrors enrich_chunk exactly so the warmed entries are the ones workers hit; with warming every worker gets a cache hit (lru_cache reads are thread-safe). Wrapped so a warm-up failure never breaks indexing (falls back to lazy per-thread misses = prior behavior). The same review round confirmed ZERO correctness races remain: the per-thread Parser covers every parse_java call site (incl. enrich's transitive parse), and an exhaustive walk of enrich_chunk's callees found no other non-thread-safe shared state. Verified: ruff clean; heavy test_lancedb_e2e main path passes. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The vectors stage (the embedding phase of
init/increment/reprocess) underused the embedding model's batching.process_*_fileembedded each chunk serially, soSentenceTransformerEmbedder(@coco.fn.as_async(batching=True, max_batch_size=64)) only ever saw batches of 1 — batching only groups concurrently in-flight calls, and a serialawaitloop keeps one in flight. Worst forincrement(few changed files → no cross-file batching).Levers (
java_index_flow_lancedb.py)asyncio.gather(*(embedder.embed(ch.text) for ch in chunks))→ onemodel.encode([...])(up to 64) instead of N batch-of-1 encodes.asyncio.to_thread(_parse_and_enrich_java, ...)so the loop keeps the embedder's batching queue fed while a file parses (matters underinit's high file concurrency).splitter.splitstays inline (shared Rust singleton).Review-driven fixes (two code-review rounds)
ast_java.py:parse_javais now thread-safe via a per-threadParser(threading.local()). Round 1 found that Tier 1 completion plan + proposals (B2a + B4 + B5) #2 madeenrich_chunkconcurrent, and enrich transitively callsparse_java(collect_annotation_meta_chain → _collect_annotation_decl_index), which used the@lru_cachesingleton tree-sitterParser— not thread-safe. Root-cause fix at the source covers every caller (flow, enrich, graph builder). Addedtests/test_ast_java_thread_safety.py.app_main: warm per-project enrich caches before fanning out to threads. Round 2 found that concurrent enrich caused acollect_annotation_meta_chainthundering herd (up to ~14 worker threads each cold-missing and redundantly walk+parseing the whole project at cold start). Warming once on the event loop eliminates it.Correctness — verified clean
Two full review rounds (8 finder angles + 4 deep concurrency verifiers, with empirical 16-thread × 2000-parse stress tests). Zero remaining races: the per-thread
Parsercovers everyparse_javacall site; an exhaustive transitive walk ofenrich_chunk's callees found no other non-thread-safe shared state (pure functions, immutablefrozensetconstants, read-onlylru_cachereturns,ast/chunksnever mutated).declare_rowbuffers → commit is all-or-nothing via the cocoindex memo guard, so no partial writes on mid-file failure.Behaviour / compatibility
process_java_fileis@coco.fn(memo=True), changed body invalidates memo entries), then returns to incremental.Validation
ruff check .— clean..venv/bin/python -m pytest tests -q— 839 passed, 13 skipped (heavy-gated).JAVA_CODEBASE_RAG_RUN_HEAVY=1 ... test_lancedb_e2e.py— main path passes (fullcocoindex update --full-reprocess+ graph + MCPsearch).test_lancedb_ignore_file_reduces_indexed_java_filesfails — pre-existing onmaster(verified by stashing the change and reproducing the identicalFileExistsError; the test's ownmkdircollides with a tracked fixture file). Unrelated; left untouched per scope.🤖 Generated with Claude Code